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Abstract 

We have developed a parallel algorithm for radial basis function (rbf) interpolation that exhibits 0{N) 
complexity, requires 0{N) storage, and scales excellently up to a thousand processes. The algorithm uses 
a GMRES iterative solver with a restricted additive Schwarz method (rasm) as a preconditioner and a fast 
matrix-vector algorithm. Previous fast rbf methods — achieving at most 0{NlogN) complexity — were 
developed using multiquadric and polyharmonic basis functions. In contrast, the present method uses 
Gaussians with a small variance (a common choice in particle methods for fluid simulation, our main target 
application). The fast decay of the Gaussian basis function allows rapid convergence of the iterative solver 
even when the subdomains in the rasm are very small. The present method was implemented in parallel 
using the petsc library (developer version). Numerical experiments demonstrate its capability in problems 
of RBF interpolation with more than 50 million data points, timing at 106 seconds (19 iterations for an error 
tolerance of 10~^^) on 1024 processors of a Blue Gene/L (700 MHz PowerPC processors). The parallel code 
is freely available in the open-source model. 

Key words: radial basis function interpolation, domain decomposition methods, GMRES, order-A^ 
algorithms, particle methods, parallel computing 



1. Introduction 

There are innumerable applications in computational science where one needs to perform approximation 
of a function based on finite data. When the data are in a certain sense "scattered" in their domain, one very 
powerful technique is radial basis function (rbf) interpolation. Since the early 1980s, after a comparative 
study of various methods for scattered-data interpolation showed the superiority of rbf methods [25] , a 
large amount of interest and research effort has been invested in this subject. For many years, however, 
the wide applicability of RBF interpolation was hindered by their numerical difficulty and expense. Indeed, 
in their mathematical expression, rbf methods produce a linear system of size equal to the number of 
data points. A direct solution of such systems — requiring 0{N^) operations and 0{N'^) memory usage — 
becomes prohibitive for more than a few thousand data points. Great progress has been made in recent 
years towards alleviating this computational burden. Wc review this progress below. The present work 
aims to position itself as the fastest and most efficient algorithm to date for RBF interpolation. It moreover 
exhibits excellent parallel scalability, and we put it to the test with up to 50 million unknowns. 

In addition to scattered-data interpolation, RBF methods are now included in the broad category of 
meshfree methods for the solution of partial differential equations. The conventional methods for solving 
PDEs rely on a discretization of the domain in terms of a mesh, i.e., a set of inter-connected nodes. In 
meshfree methods, the aim is to provide a means of discretization that relies only on a set of nodes, but with 
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no knowledge of the connectivity among them. One of the motivations is that the generation of a mesh (for 
example, as used in finite difference and finite element methods) can be a cumbersome task, especially for 
problems with complex geometry and moving boundaries. It is often cited that mesh generation can take 
up to 80% of the time required for computational analysis in industrial applications [30]. Indeed, this is the 
greatest hurdle for the wide use of computational mechanics as a design tool. 

Meshfree methods for the solution of pdes can be grouped into two broad categories. One, as already 
mentioned, includes methods based on rbf interpolation. The second is based on the least squares technique. 
To this second class of methods belong element-free Galerkin methods [12], reproducing kernel particle 
methods [40], /ip-clouds methods [20], partition of imity methods [2], and meshless local Petrov-Galerkin 
methods [1]. We will not refer to least squares methods further. They differ from rbf methods in that they 
do not satisfy the interpolation conditions at the data. In certain applications, e.g.^ when it is known that 
the data contain noise, the smoothing quality of least-squares methods may be desirable. We restrict this 
work to the interpolating rbf methods, which have superior approximation quality without smoothing. 

RBF interpolation has a number of favorable characteristics. Madych et al. [41] have shown that multi- 
quadric rbf interpolation schemes converge faster as the dimension increases, and converge exponentially 
as the density of the nodes increases. These favorable characteristics encouraged others to use it as a tool 
for solving PDEs. In the early 1990s, Kansa [32, 33] used rbf interpolation in a global collocation method 
with multiquadrics to solve pdes. He found that rbfs could yield a very accurate solution for parabolic 
and elliptic pdes. This method has been further extended to symmetric collocation by Fasshauer [22], to 
modified collocation by Chen [18] and to indirect collocation by Mai-Duy [42]. 

Amongst the unfavorable characteristics of rbf interpolation, the most severe is the ill-conditioning of 
the linear system it produces. Preconditioning can result in significant clustering of eigenvalues and improves 
the condition numbers of the interpolation problem by several orders of magnitude [8] . Considerable progress 
in the iterative solution of multiquadric and polyharmonic RBF interpolation was made with the development 
of preconditioners using approximate cardinal functions based on a subset of the interpolating points. This 
approach was first proposed by Beatson and Powell [11]. Beatson et al. [8] coupled this method with the 
GMRES iterative solver and a fast matrix-vector algorithm for polyharmonic splines to construct a method 
with 0{N) storage and 0{N \ogN) complexity. Ling and Kansa [39] showed that an approximate least 
squares cardinal function preconditioner can be used to transform an ill-conditioned system arising from 
RBF solution of PDES into a well-conditioned system. Faul et al. [24] used a carefully selected set of q points 
to construct a preconditioner for which they were able to interpolate problems with d = 40 dimensions and 
N = 10, 000 points in a few iterations. Gumerov and Duraiswami [28] used the fast multipole method (fmm) 
to accelerate the matrix- vector multiplication in Paul's method and reduced the calculation cost from C(iV^) 
to O(A^logA^). There have been a few other studies where fast summation methods were used to accelerate 
the matrix- vector multiplications for polyharmonic [10, 44, 27], multiquadric [19, 45], and Gaussian [26, 45] 
basis functions. 

When handling problems with large numbers of data points (say, millions of points), the large amount of 
memory usage can also become a problem. As the problem size grows, parallelization on distributed memory 
architectures becomes necessary. Therefore, domain decomposition methods (ddm) [47] have been gaining 
interest. The idea behind ddm is to divide the considered domain into a number of subdomains and then try 
to solve the original problem as a series of subproblems that interact through the interfaces. Hardy [29] was 
the first to use DDMs for RBF interpolation. Beatson et al. [9] used the multiplicative Schwarz method along 
with a coarse-grid preconditioner, and a fast matrix-vector algorithm for polyharmonic splines. Their code 
requires 0{N) storage and 0{N \ogN) time, and the timings are quite impressive. There have been a few 
related efforts using the truncated multiquadric function [34], multi-zone methods [50], and a comparison 
between overlapping and non-overlapping ddms for matching and non-matching subdomain interfaces [37] . 
Ling and Kansa [38] used the additive Schwarz method and found that increasing the number of subdomains 
not only reduces the operation count but also increases the rate of convergence. However, none of these 
studies mention the parallelizability of these potentially highly-parallel methods, as they all view the DDMs 
merely as a way to deal with the ill-conditioning of the system. 

All work with DDMs mentioned above involved global and conditionally positive-definite rbfs, such as 
multiquadrics and thin-plate splines. For Gaussian basis functions, the choice and effectiveness of the pre- 
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conditioner, DDM, and fast matrix-vector algorithm differ substantially. In fact, for Gaussian basis functions 
with a small standard deviation a, the resulting system is not so ill-conditioned, and preconditioning be- 
comes a straightforward task of truncating the effect of the far field [7] . In such a case, ddms will provide this 
function of preconditioning by localizing the interaction, i.e., making the bandwidth of the matrix smaller. 
Furthermore, the matrix-vector multiplication could be performed by a fast Gauss transform [26] if cr were 
much larger than the inter-node spacing h. However, for problems that have a ~ h, it is faster to perform 
a direct calculation for the neighboring nodes. These specializations for the domain decomposition strategy 
and matrix-vector multiplication have a large positive impact on the performance and, as we show in this 
work, result in a method which has 0{N) storage and 0{N) time. 

An application that will benefit tremendously from fast RBF interpolation using Gaussian basis functions 
with small cr, is the vortex particle method. It is now a well-known fact that particle methods for continuum 
systems generally require some sort of spatial adaptation, and rbf interpolation can be used in a number 
of different ways to handle this. The use of rbf interpolation for spatial adaptation in the vortex particle 
method was proposed and demonstrated in [4] . It was shown that the core-spreading method [35] coupled 
with RBF interpolation was more accurate than the standard particle strength exchange scheme with remesh- 
ing. A parallel version of the vortex particle method with rbf interpolation for spatial adaptation was later 
developed using the petsc library [3], [6]. The parallel vortex code with rbf interpolation was later used 
in a comprehensive study of vortex tripole evolution [5]. The same method was also used in a calculation 
of homogeneous isotropic turbulence in three dimensions, where the results of the vortex particle method 
agreed quantitatively with those of the reference calculation using a pseudo-spectral method with the same 
number of calculation points [51]. However, the calculation time for the RBF interpolation became domi- 
nant, compared to the other parts of the code. Although convergence of the Krylov method was generally 
observed with the parameters chosen in the bulk of experiments that were performed, no investigation was 
pursued at that time on the conditioning and convergence of the RBF interpolation for this particular type 
of basis function. The generality of rbf interpolation allows the possibility to extend this method to other 
meshfree methods — typically the ones that involve convection of the nodes, for which the basis functions 
must have asymptotically local influence (like the Gaussian). 

The main goal of this paper is to develop a highly-parallel rbf interpolation algorithm by using GMRES 
with the restricted additive Schwarz method (rasm) [16] as a preconditioner along with a fast matrix-vector 
multiplication. We demonstrate that, for Gaussian basis functions with small cr, it is possible to eliminate 
global communications completely by exploiting the asymptotic locality of the basis function. This results 
in 0{N /Nprocs) storage and close to 0{N/Nprocs) complexity, where Nprocs is the number of processes. A 
serial version of the present method was investigated by Torres and Barba [48] . Their results show excellent 
convergence rates for different types of data, but they only show a limited portion of what this method truly 
has to offer. Unlike the previous studies on domain decomposition-based rbf interpolation, our focus is on 
the optimum subdivision of the domain, and parallel scalability. 

2. RBF interpolation with domain decomposition 

2.1. Background on radial basis function interpolation 

The technique of radial basis function (rbf) interpolation was introduced as a tool for solving multivariate 
scattered data interpolation problems. There has been a large production of research results in this field, 
with some excellent books being published in recent years [15, 23]. 

The problem of scattered data interpolation is that of approximating a function / at an arbitrary point x, 
when its values are only known on a limited set of points Xj £ R'^, j = 1, N. This is done by representing 
the function / by the interpolant s{x): 

N M 

cf>{\\x-x,\\)+Y,JkPk{S;), (1) 

J=l k=l 

where is the radial basis function, || • || denotes the Euclidean norm, and {pA:(a;)}fcLi is a basis for H^ 
(space of all d-variate polynomials with degree of less than M). Micchelli [43] showed that there is a direct 
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Table 1: Examples of radial basis functions 



Name of rbf 



Gaussian e ^ 

r",!/ e 2N- 1 
log r, 1/ e 2N 

Multiquadric (1 + e'^r'^)", u > 0,1^ (^N 

Inverse multiquadric (1 + e^r'^)" ji' < 



Polyharmonic 



relationship between (j>{r) being positive definite and the function <j>{y/r) being completely monotone. If 
4>{-\/r) is completely monotone, then the resulting system is positive definite on its own and does not need 
to be augmented by polynomials, which is the case for Gaussians and inverse multiquadrics. On the other 
hand, if (— 1)^^(/)(^^^(y^) is completely monotone for some M > 0, then (f>{r) is said to be conditionally 
positive definite of order M , and requires augmentation by polynomials of degree il/ — 1 to become positive 
definite. 

The coefficients Xj and the polynomial functions Pk{x) are chosen to satisfy the fitting conditions. 



along with the additional constraints 



This results in a linear system 



where: 



s{xj)^f{xj), j = l,...,iV 



JV 

'^XjPkixj) = 0, k=l,...,M 




(2) 
(3) 
(4) 



= ^j}^.^^ . {7 - lj}^=N+i , and |/ = /,}^^^ . 

For future reference of Equation (4) we denote the coefficient matrix as A, the solution vector as .t, and 
the right hand side as b. Wc use the symbol O to represent the zero matrix of appropriate size, while the 
symbol is used for the zero vector/scalar. Popular choices for the radial basis function arc summarized 
in Table 1. 



2.2. Restricted additive Schwarz method ('RASMj 

The idea behind domain decomposition methods is to divide the considered domain into a number of 
subdomains and then try to solve the original problem as a series of subproblems that interact through 
the interfaces. A brief explanation is given below to introduce the main concepts of domain decomposition 
methods, for completeness. The key components that will be explained in this section are used to describe 
the parallelization procedure and analyze the results later on in this paper. 

We consider a square domain Q as shown in Figure 1, having a somewhat uniform distribution of calcu- 
lation points within the domain. These calculation points could be the nodes of a finite difference/element 
mesh or they could be particles. Let us divide the entire calculation domain fl, into overlapping subdomains 
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Figure 1: Illustration of Overlapping and Non-overlapping Domains 



fli. Also, let Qi denote the non-overlapping subdomains as shown in Figure 1. The solution of a linear 
system Ax = b for the whole domain can be obtained by sequentially solving the individual overlapping 
subdomains AiXUi = ^sii, where Ai, xq-, and ba- are the sub-elements associated with the domain Qi for A, 
X, and 6, respectively. The values x in the overlapping region can be overwritten when the next subdomain 
is solved. If the values x of the previous subdomains are used to update the solution of the present subdo- 
main calculation, it is called a multiplicative Schwarz method. If each subdomain is solved individually and 
the solution of the entire domain is updated simultaneously at the end of each iteration step, it becomes 
an additive Schwarz method. Furthermore, when the values x outside of the subdomain Qi are discarded 
after the calculation of each subdomain Qi, it is called the restricted additive Schwarz method (rasm) [16]. 
The restricted additive Schwarz method is known to converge faster than the additive Schwarz method and 
requires less communication in parallel calculations [21]. As far as we know, RASM has not been used before 
within RBF interpolation with domain decomposition. 

The mathematical formulae for the above methods can be written quite succinctly by defining a restriction 
matrix Ri, 

xn,^R.x = {I 0)( ) (5) 

V xn\n, ) 

which restricts a vector in the whole domain x to one in the subdomain zq. . If the entire domain has N 
elements and the subdomain has Ni elements, Ri is a. Ni x N matrix that restricts the vector x of size N 
into vector Xii- of size Ni. The transpose of the restriction matrix is the projection matrix, 

X = RJxq,^ (6) 

which projects the vector in the subdomain onto the whole domain. Note that the Ri matrices are usually 
not formed in practice, but are introduced to express the different algorithms in a concise manner. Using 
the restriction and projection matrices, the multiplicative Schwarz method for the n-th iteration step with 
p overlapping subdomains can be written as 

^n+i/p ^ RjA]:^Ri{b~ Ax") 



x 



where Ai ~ RiARf . The n-th iteration step with p overlapping subdomains of the additive Schwarz can be 
expressed as 

p 

= 4- ^ RfA- 1 i?, (5 - Ax" ) (8) 



The only difference between the muhiphcative Schwarz method in Equation (7) and the additive Schwarz 
method in Equation (8) is that the muhiphcative Schwarz method updates the residual after each subdomain 
is calculated, whereas the additive Schwarz method updates the residual only once per iteration. 

Equation (8) can be extended to the restricted additive Schwarz method (rasm) by defining a projection 
matrix 

^ = Rf^n. (9) 

which projects only the values in the non-overlapping subdomain x^^- onto the whole domain. The n-th 
iteration step of the rasm is 

p 

f ^+1 = x'' + J2 R[Ai^R,ib~ Ax^) (10) 

i=l 

where the matrix Ri restricts the residual to the overlapping subdomain ili, solves the problem on f2i, 
and Rf projects the solution in the non-overlapping subdomain fli onto the entire domain. The matrix 

= (11) 
1=1 

can be viewed as a preconditioner, and can be used along with any Krylov subspace method. In the present 
calculation we use the GMRES [46] . 

The calculation of inner products, norms, and scalar multiplication in the GMRES are not computationally 
intensive nor do they require large data transfer. Therefore, the efficiency and parallelism of the GMRES 
calculation depends on how the preconditioning is handled. When the preconditioning is done by rasm, 
the efficiency of the calculation is strongly affected by the size of the overlapping and non-overlapping 
subdomains. One needs to find the balance between having a large number of small subdomains against 
having a small number of large subdomains. Also, increasing the overlap region allows the GMRES to converge 
in fewer iterations, but each iteration will take longer because the direct solve will be performed for larger 
overlapping subdomains. We have performed a thorough investigation of the optimum subdomain size for 
our algorithm, and present the results in §3. 

2.3. Parallelization of RASM 

For the parallelization, we assume a distributed memory model that stores only the local components 
of each vector. Parallelization of the calculation of inner products, norms, and scalar multiplication are 
trivial, and the only noteworthy task when parallelizing the preconditioned GMRES is the preconditioning 
itself. When the RASM is used as the preconditioner, it can be seen that Equation (10) can be readily 
parallelized by assigning each subdomain to a separate process. A schematic of the parallelization of the 
RASM corresponding to Equation (10) is shown in Figure 2(a). The different colors represent separate 
processes. The "rank" refers to the rank of the MPI process. Each process owns a local portion of the 
global vectors x and b. The partitioning of the vectors corresponds to the non-overlapping subdomains 
shown in Figure 1. For the case of one subdomain per process, as seen in Figure 2(a), each process owns 
one submatrix Ai. Each process would require a small amount of communication from other processes to 
calculate Ax". Once this is done, the local residual for the overlapping subdomain ri — hi ~ Ax^ can be 
used as the right-hand-side to solve the local system. Then, the non-overlapping part of the solution to the 
local system is added to the vector af". This is done by simply updating only the local values that belong 
to the process. This procedure is depicted in Figure 2(a). 

For RBF interpolation using basis functions with negligible global effects, the matrix A can be considered 
to have a finite bandwidth. Thus, the calculation of Ax^ can also be done somewhat locally. Let us consider 
a normalized Gaussian basis function in 2D as an example. The matrix A has the following elements: 
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rank = 
rank = 1 



rank = 



rank = 1 



^+1 



+ 



A 



O 



b-Ax' 



(a) One subdomain per process 



A7 



+ 



A, 



A-3' 



a; 



b-Ax^ 



(b) Multiple subdomains per process 

Figure 2: Parallelization strategy of the rasm 



Since the Gaussian function decays rapidly, the elements of matrix A corresponding to the interaction 
of distant points can be neglected. This "sparsity" of A depends on the relative size of the calculation 
domain compared to the standard deviation, ct, of the Gaussian function. For problems with relatively small 
(T, the calculation of Ax^ can be done by a sparse matrix- vector multiplication with controllable accuracy. 
If a is kept constant while the size of the calculation domain is increased along with N, the calculation 
load will scale as 0{N). The communication required to perform Ax^ is also limited to a constant number 
of elements in the vicinity. Therefore, the rasm becomes an extremely parallel algorithm with minimum 
communication for the RBF interpolation using Gaussian basis functions with small cr. 

In the present work, domain decomposition is used not only to save storage, but also to increase the 
speed of convergence. Therefore, it is useful to have independent control over the subdomain size and the 
number of processes, as shown in Figure 2(b). In this case, each process stores multiple sets of .t^ and 6^, and 
loops over the the subdomain solves. The restriction from overlapping subdomains to the non-overlapping 
subdomains is not as straightforward as the case of one subdomain per process, because the local portion of 
the vector x does not necessarily correspond to the values that need to be updated. In the present algorithm, 
we perform the restriction by using a separate index list for the non-ovelapping subdomain. 



2.4- Implementation details 

The implementation of the algorithm developed here has been realized using the Portable, Extensible 
Toolkit for Scientific Computation (petsc) [3]. This is a library developed over many years at Argonne 
National Laboratories^. It is fine-tuned and well-supported, and provides an interface between scientific 
application codes and low-level libraries such as blas, lapack, and mpi. All vectors and matrices are 
distributed among processes by petsc and only the local portion is stored. Data types such as Vec and Mat 
allow the user to create these distributed objects and access them using global indices. The fact that all MPI 
communications happen "under the hood" allows users to manage parallel codes (almost) as if they were 
serial codes. Another feature of petsc is that it can switch between different Krylov subspacc solvers and 
preconditioners with just one command line option. 



http:/ /www. mcs.anl.gov/petsc/ 



7 



The current version of petsc (3.0.0, released December 2008) provides routines for domain decomposition 
using additive Schwartz methods. The goal of these routines seems to have been using domain decomposition 
for parallelization only, and thus they were written to hold one subdomain per process. To fit our needs, 
one of the authors (Matthew G. Kncplcy, who is a member of the petsc developer team) has extended 
the PETSC functions to handle multiple subdomains per process, and thus we are able to implement the 
restrictive additive Schwarz method as used in our algorithm. This extension to petsc (currently only 
available in the developer version) enabled us to implement our rbf interpolation algorithm with excellent 
parallel scalability. 

In our implementation, we use the following features of the petsc library. 

> vectors (for vectors x and b) 

> ghost vectors (for overlapping subdomains and matrix-vector multiplication) 

> index sets (for setting overlapping and non-overlapping subdomains) 

> shell matrices (for matrices A and M) 

> KSP linear solvers (gmres, rasm) 

The parallel GMRES-RASM method that has been described earlier in this section can be implemented in 
PETSC using the above features. The vectors are distributed among the processes and the local storage is 
N/Nproc for each vector, where N is the number of elements and Nproc is the number of processes. The 
calculation of inner products, norms, and scalar multiplication are done by calling petsc routines. The ghost 
vectors handle the non-local values that are necessary for the direct solve for the overlapping subdomains 
and also the matrix vector multiplication of Ax. One scatters the global vector into a local "ghosted" 
work vector, performs the computation on the local work vectors, and then scatters back into the global 
solution vector. The index sets are global indices that are used to define the elements in the overlapping 
and non-overlapping subdomain. Although their numbering is global, the index sets are distributed among 
the processes in the same way as the vectors. The shell matrices allow one to calculate the matrix vector 
multiplication Ax without actually storing the matrix A. For the preconditioner A4 , only the small matrices 
for the subdomain Ai are formed. The matrix Ai is then used to calculate Equation (10). After all these 
objects are defined, petsc finally calculates the linear system solution using the KSPSolveO command, 
which takes the solver type gmres (default) and the preconditioner type rasm from the command line 
options. 

3. Parametric study of convergence rate 

The rate of convergence, calculation time, storage requirements, and communication requirements of the 
present method depend strongly on five independent parameters. Figure 3 illustrates these parameters as 
they refer to the domain decomposition choices, and basis function choices. In this section, we present the 
results of a non-dimensionalized parametric study to demonstrate the sensitivity of the convergence rate 
and calculation time to these parameters. 

All the calculations presented here were obtained on the Cray XT4 machine on the hector system at 
the UK National Supercomputing Service ^. hector offers a total of 5664 AMD 2.3 GHz quad-core Optcron 
processors. The system offers 8 GB of memory per processor, which is shared between its four cores. The 
processors are connected with a high-bandwidth interconnect using Cray ScaStar2 communication chips. 
Only one processor of this machine was used for obtaining the results in this section — parallel results are 
presented in the next section. 



^http:/ /www. hector. ac.uk/ 
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Figure 3: Illustration of parameters in the rasm for rbf interpolation using Gaussian basis functions 



3.1. Calculation conditions 

The whole domain is divided into non-overlapping subdomains and overlapping subdomains Q.i as 
shown in Figure 3. We define B as the size of the non-overlapping subdomain and D as the size of the 
overlapping subdomain. We also define an additional subdomain with size T, which defines the non-zero 
entries for the matrix-vector multiplication. Everything outside of this region is neglected when calculating 
Ax for VLi. In the magnified circle in Figure 3, an illustration of the Gaussian basis functions is shown, where 
h is the average distance between the calculation points and a is the standard deviation of the Gaussian 
distribution. When a is large compared to /i, the matrix A becomes ill-conditioned. In other words, the 
more global the basis functions are, the slower the convergence. On the other hand, if the locality was 
perfect and changing the coefficient \j in Equation 4 affects only the data at that point fj , the linear system 
A would be diagonal, i.e. perfectly conditioned, but would suffer from poor approximation properties. 

The following function by Franke is a standard test function for 2D scattered data fitting [25] : 

^ 2 5 

We will use this function as a test case throughout the present paper. The points are distributed on a square 
lattice on the domain [0,1]^. We also used quasi-scattered data, by shifting the position using a random 
number between and h/2, where h is the average distance between points. 

3.2. Parametric study of convergence rate 

We first investigate the rate of convergence of the present method for different overlap ratio of the 
Gaussian basis function, h/a. The size of the domain is set from to 1 and a = 0.01, which results in 
N = {2/h + 1)^ = 10, 201 points for h/a = 1.0. The size of the subdomains B is set to 5a, the overlapping 
subdomains are of size D ^ 1.9 x B, and the truncated domain for the mat-vec products has size T = B + Aa 
{T = B + 6a for h/a = 0.8). These numbers are later shown to be the optimal set of parameters that result 
in the fastest calculation time. Upon execution of the code, we passed the following command line options 
to PETSc: 

-pc_type asm -sub_pc_type lu -sub_mat_type dense 
-vecscatter_alltoall -ksp_rtol le-13 -ksp_monitor -log_summary 
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Number of Iterations 



Figure 4: Convergence rate of the GMRES iteration 

The Z2-norm of the residual ||r||2 of the GMRES iterations is plotted against the number of iterations 
in Figure 4. All cases converged to ||7'||2 = 10^^^, but for h/a ~ 0.8 the number of iterations required is 
significantly larger. This is due to the system A becoming more ill-conditioned. Nevertheless, the present 
method converges within 20 iterations for h/a > 0.9. Furthermore, tests using larger N have shown that 
this convergence rate does not change with the problem size. This can be indirectly observed in the 0{N) 
behavior of our method shown in Figure 5(a). Since each iteration takes 0{N) time, the number of iterations 
must remain constant for the entire algorithm to scale as 0{N). The dashed line in Figure 5(a) is the function 
y = O.OOOlSx, which is drawn in order to quantify the scaling exponent. The coefficient 0.00018 depends on 
the hardware, and implementation. The memory usage is plotted against the number of elements in Figure 
5(b). The dashed line is the function y ~ 0.008x, which is drawn in order to quantify the scaling exponent. 
The storage requirements of the present method also scales as 0{N). 

The rate of convergence does not change with the problem size N as long as the ratio between h/a, 
B/a, and D/a remain constant. The maximum number of N that has been calculated is 50,580,544, and 
we discuss this result later in Section 4. Even for this case, the number of iterations it takes to converge to 
||r||2 = 10~^^ remains exactly the same. This is a unique property of the basis function that we are using. 
For multiquadric and polyharmonics, some sort of hierarchical method would be required in order to have 
a constant convergence rate. 

We now investigate the effect of changing B and D, to optimize the performance of the rasm. The 
number of iterations is takes to converge to ||r||2 — 10^^'' is shown in Figure 6 for different i3, D and h/a. 
The different plots are for different values of h/a. The standard deviation of the Gaussian is cr = 0.01 
for all cases, B/a is changed from 3 to 7, while D is changed from 1.55 to 2.2>B. These parameters were 
selected by performing a parameter study for a larger range and determining the optimum values B = 5a 
and D = 1.9B, and then choosing the values close to it to see how the behavior changes. The optimum 
value for T is also determined by performing a parameter study for a larger range, but will not be changed 
here for brevity. The optimum value for T was B + 4a for h/a > 0.9 and B + 6a for h/a = 0.8, regardless 
of B and D. 

It is useful to view Figure 6 in comparison with Figure 7, where the calculation time is shown instead of 
the number of iterations. Keeping B/a constant and increasing D results in the same number of subdomains, 
but larger overlapping subdomains, so each direct solve for the subdomain takes longer. Since the number of 
subdomains remains constant, the total calculation time to go through all subdomains increases proportional 
to each direct solve. On the other hand, a larger overlapping subdomain gives a better convergence rate so 
fewer iterations will be required to achieve the same accuracy. Thus, there exists an optimum value of D 
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for which the calculation time per iteration is balanced with the required number of iterations to yield the 
best calculation time. Similarly, increasing B/a results in fewer number of subdomains, but the size of each 
subdomain will increase along with the time it takes to solve for it. Thus, there exists an optimum value of 
B/a for which the calculation time per subdomain is balanced with the number of subdomains. It can be 
seen from Figure 7 that the optimum set of parameters is B — and D = 1.9B. This optimum set of B 
and D seems to be consistent throughout the present range oi h/a. 

4. Scaling of the parallel implementation 

There are two common measures for evaluating the performance of parallel codes: the strong scaling 
and the weak scaling. Strong scaling reflects how the solution time varies with the number of processes for 
a fixed total problem size. Weak scaling, in contrast, shows how the solution time varies with the number 
of processes for a fixed problem size per process. Throughout this paper, we strictly differentiate between 
processors and processes. Modern processors have multiple cores and can run multiple MPI processes without 
loss of performance. Thus, the proper unit for parallelization is the MPI process and not the processor. 

In the previous section we showed that our algorithm scales as 0{N) in a serial implementation. In this 
section, we demonstrate that in parallel our algorithm and implementation scale as 0{N / Nprocs) by keeping 
N constant and changing Nprocs (strong scaling), and keeping N/ Nprocs constant while changing Nprocs 
(weak scaling). 

4-1. Strong scaling 

For the parallel calculations in this section we used the Blue Gene/L machine at the Center for Compu- 
tational Science (CCS), Boston University ^, along with the Cray XT4 machine mentioned in the previous 
section. The Boston University Blue Gene is a single rack system and has 1024 nodes with dual core 32-bit 
PPC440 processors (700 MHz) and 512 MB of main memory''. 

In the present rbf interpolation method using Gaussian basis functions with small cr, the problem is 
highly parallel. This is so because in addition to the fact that the rasm itself is highly parallel, the matrix 
vector multiplication for Ax can be done using only the information within the influence distance of the 



^http: / /ccs. bu.edu/ 

*See http;/ /scv. bu.edu/computation/bluegene 
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Figure 10: Strong scaling and memory usage for different number of processes. 



Gaussian function {T in Fig. 3). Note that this is only possible for rapidly decaying basis functions, where 
the matrix A is banded. Fast summation methods for global basis functions will involve more communication 
and may not scale as well for parallel calculations — this is the case in [28] . 

The same calculation as the one shown Figure 7 was performed in parallel. The results for 1, 2, 4, and 
8 processes are shown in Figure 8. Note that the results in Figure 8(a) are the same results as the one in 
Figure 7(b), but the axis has been rescaled. The axes of the four plots in Figure 8 have been aligned so 
that the speed-up is better represented. The optimum parameter choices oi B = 5a and D = 1.9B do not 
change when the number of processes is increased. The scaling is not quite what we expect it to be because 
the problem size N ^ {2/h+ 1)^ = 12, 544 is rather small, causing the otherwise insignificant serial parts of 
the program to degrade the scalability. 

The results shown so far are for equally spaced data points on a square lattice. We now consider the 
effect of having scattered data points. The same calculation as in Figure 8 is performed for the same number 
of scattered data points. The results are shown in Figure 9. For scattered data, the optimum size of the 
B, D, and T domain sketched in Figure 3 changed slightly to B = 6a, D = 1.9 B, and T = B + 6a for 
h/a — 0.9. Compared to the case using a square lattice, the optimum value for B increases by a and T 
increases by 2a, while D/B remains the same. The domain must be slightly larger than the square lattice 
case, because the random positioning of data points results in a slightly more ill-conditioned system. The 
calculation time also increases due to the number of iterations increasing. However, in our target application 
we use RBF interpolation to interpolate from scattered points onto an equally-spaced lattice distribution. 
Therefore, the performance of the interpolation onto scattered data points is not of primary importance. 
We will note that the difference is small, and proceed with our analysis for the parallel performance using 
the data points on a square lattice. 

We now select the optimum parameter B = 5a and D = 1.9-B and test the speed-up for a larger number 
of processes. The calculation is performed on the Cray XT4 used above, and also a Blue Gene/L. The 
strong scaling results are shown in Figure 10(a). Preliminary calculations have shown that smaller h/a gives 
slightly better scaling, but only the results for h/a ^ 0.9 will be shown here. Also, a is set to 0.005 for which 
case N = 49, 729. The memory limitation on the Blue Genc/L 512 MB /node prevented the strong scaling 
tests for larger calculations. The parallel efficiency using for 128 processes is 84% for the Blue Genc/L and 
36% for the Cray XT4, using this test problem. 

The memory usage for the strong scaling tests in shown in Figure 10(b). The dashed line is the function 
y = 350/.T. It can be seen that the memory usage decreases inversely proportional (or even better) to the 
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number of processes Nprocs- Thus, it is safe to say that we have a storage requirement of 0{N /Nprocs)- 
For the weak scaling, the 0{N/ Nprocs) storage of our method allows the calculation of large-scale problems. 
This will be shown in the next subsection. 

4-.2. Weak scaling 

The weak scaling on Blue Gene/L is shown in Figure 11(a). The parallel efhciency is the speed-up divided 
by the number of processes. Parallel efficiency equal to unity means that the parallel implementation scales 
perfectly with the serial implementation. The number of calculation points grows from 49, 729 (1 process) 
to 50,580,544 (1024 processes). Even under the contraints of 512 MB /node, the 0{N /Nprocs) storage of 
our code enabled the calculation of large problems. The weak scaling is almost perfect until 128 processes 
and gradually declines after that. The parallel efhciency for 1024 processes is 78%. 

The breakdown of the calculation time for the weak scaling test is shown in Figure 11(b). "Initializa- 
tion" corresponds to the allocation and initialization of all variables, "PCSetUp" is the setup of the RASM 
preconditioner, "MatMult" is the matrix-vector multiplication, "MatSolve" is the matrix solver for the sub- 
domains, "VecMDot" is the dot product of vectors, and "Other" is the total of everything else. The total 
calculation time is the sum of all colors, which is around 80 seconds for 1 process and around 106 seconds for 
1024 processes. The curve in Figure 11(a) is a vertical flip of the curve at the top of Figure 11(b). Fig. 11(b) 
gives a clear view of which part of the code is degrading the weak scaling performance. It is mainly the 
communication time in "PCSetUp" and "VecMDot" that is increasing the overall calculation time. Further 
tuning of these communication routines could allow the present code to scale up to tens of thousands of 
processors. However, such tuning is too hardware-specific to be discussed in the present paper. 



5. Conclusions 

This work positions itself among a series of recent efforts to provide fast algorithms for the solution 
of radial basis function (rbf) interpolation problems. The subject has gained much momentum in recent 
years, due to growing enthusiasm over meshfree methods, reflected in new books, such as [23], and various 
international conferences dedicated to the topic. Meshfree methods are attractive due to the infamous 
hurdle of mesh generation and maintenance for many applications, especially those involving complex or 
moving geometries. Moreover, RBF interpolation is an important technique for approximation of unknown 
functions based on unordered data. Applications include solution of partial differential equations [32, 33], 
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reconstruction of surfaces for design of medical implants based on imaging [17], interpolation of geophysical 
data [13], modeling of groundwater contaminant transport [36], to name just a few. 

In view of the multitude of important applications, the effort to offer computational efficiency has been 
strong. There have persistently been two challenges for this effort to succeed: the need to solve a dense 
and ill-conditioned linear system, and the cost of evaluating the rbf interpolant once a solution is found. 
This last challenge impacts on the first when using iterative methods of solution, as the internal iterations 
require an RBF evaluation (a dense matrix- vector multiplication). Until recently, the only way to address 
these challenges was to opt for using basis functions of compact support [49, 14]. The development of such 
compact support bases had a significant influence. It is known, however, that the approximation properties 
of compact support bases are inferior to global bases [15, p. 150]. For this reason, several authors have 
continued to investigate ways to provide fast rbf interpolation with global bases. 

Our contribution consists of an RBF interpolation algorithm with 0{N) complexity, and 0{N) storage. 
Moreover, we have formulated the algorithm in parallel. As far as we could find in the published literature, 
the only other parallel rbf method was presented in [31]. That work used polyharmonic basis functions, and 
parallel efficiency was rather modest; the largest problem reportedly solved involved 20, 000 interpolation 
points. 

Two other rbf algorithms of note are the one developed by Beatson and others [9] and the accelerated 
version by Gumerov and Duraiswami [28] of a method developed in [24]. Both of these were implemented 
for serial computations only (as far as we are aware), and both have 0{N\ogN) complexity. In the first 
case, a multiplicative domain decomposition approach was used with polyharmonic basis functions. The 
numerical demonstrations there represent the largest problem size we could find reported, with 5 million 
interpolation points in 2D. In the second case, a preconditioner based on approximate cardinal functions 
is applied [11, 8], and both multi-quadric and polyharmonic bases are used. There, the largest calculation 
reported interpolated 1 million data points in 2D. As the work in [28] is very recent, the hardware used is 
comparable to the one used in our results. The reported timing for 1 million points was over 6000 seconds 
(in one 3.2 GHz processor). Although the basis function used is different, as well as the method of solution, 
a casual comparison can be made with our results. As shown in Figure 5(a), we have timed a solution with 1 
million points at about 200 seconds (with a slower 2.3 GHz processor). Moreover, the accuracy we imposed 
in our experiment was higher by 7 orders of magnitude. 

In light of the publications cited above, we are persuaded that the present work positions itself as the 
fastest and most efficient rbf algorithm to date. We aim to multiply the impact of our algorithm by realizing 
an implementation that is based on a well-tuned, well-supported parallel library for scientific computing, 
PETSc, and by offering the software free of charge, and in the open model. 

In summary, our algorithm and its implementation have demonstrated good asymptotic behavior in the 
following aspects: 

> Complexity: 0{N) for the observed range of = 10^ — 10^ 

> Storage: 0{N) for the observed range of A^ = 10^ — 10^ 

> Convergence Rate: 0(1) for the observed range of A^ (less than 20 iterations for most cases) 
£> Strong scaling: parallel efficiency of 84% on 128 processes for A^ = 49, 729 

> Weak scaling: parallel efficiency of 78% on 1024 processes for N/Nprocs ~ 49, 729 

> Parallel storage: 0{N/Nprocs) for the range Np^ocs = 1 — 128 

The largest rbf interpolation problem we have calculated so far had over 50 million data points, and this 
calculation was timed at 106 seconds (19 iterations, for an accuracy of 10~^^) using 1024 processors on a 
Blue Gene/L supercomputer. Considering the fact that the processors of the Blue Gene/L are significantly 
slower than the high-end CPUs available today (about 5 times) and have only 512 MB /node of memory, 
the performance is quite remarkable. 

For information about downloading the parallel software, which has been dubbed petRBF (for 'parallel, 
extensible toolkit for RBF interpolation), please visit the website at http://barbagroup.bu.edu/. 
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